Establishment and validation of multiclassification prediction models for pulmonary nodules based on machine learning

Abstract Background Lung cancer is the leading cause of cancer‐related death worldwide. This study aimed to establish novel multiclassification prediction models based on machine learning (ML) to predict the probability of malignancy in pulmonary nodules (PNs) and to compare with three published models. Methods Nine hundred fourteen patients with PNs were collected from four medical institutions (A, B, C and D), which were organized into tables containing clinical features, radiologic features and laboratory test features. Patients were divided into benign lesion (BL), precursor lesion (PL) and malignant lesion (ML) groups according to pathological diagnosis. Approximately 80% of patients in A (total/male: 632/269, age: 57.73 ± 11.06) were randomly selected as a training set; the remaining 20% were used as an internal test set; and the patients in B (total/male: 94/53, age: 60.04 ± 11.22), C (total/male: 94/47, age: 59.30 ± 9.86) and D (total/male: 94/61, age: 62.0 ± 11.09) were used as an external validation set. Logical regression (LR), decision tree (DT), random forest (RF) and support vector machine (SVM) were used to establish prediction models. Finally, the Mayo model, Peking University People's Hospital (PKUPH) model and Brock model were externally validated in our patients. Results The AUC values of RF model for MLs, PLs and BLs were 0.80 (95% CI: 0.73–0.88), 0.90 (95% CI: 0.82–0.99) and 0.75 (95% CI: 0.67–0.88), respectively. The weighted average AUC value of the RF model for the external validation set was 0.71 (95% CI: 0.67–0.73), and its AUC values for MLs, PLs and BLs were 0.71 (95% CI: 0.68–0.79), 0.98 (95% CI: 0.88–1.07) and 0.68 (95% CI: 0.61–0.74), respectively. The AUC values of the Mayo model, PKUPH model and Brock model were 0.68 (95% CI: 0.62–0.74), 0.64 (95% CI: 0.58–0.70) and 0.57 (95% CI: 0.49–0.65), respectively. Conclusions The RF model performed best, and its predictive performance was better than that of the three published models, which may provide a new noninvasive method for the risk assessment of PNs.


Conclusions:
The RF model performed best, and its predictive performance was better than that of the three published models, which may provide a new noninvasive method for the risk assessment of PNs.
K E Y W O R D S machine learning (ML), prediction model, probability of malignancy, pulmonary nodules (PNs)

| INTRODUCTION
Lung cancer is the leading cause of cancer-related death worldwide.According to GLOBOCAN data, lung cancer caused approximately 1.8 million deaths in 2020, which constituted 18% of cancer-related deaths. 1 In 2015, there were approximately 0.73 million new cases of lung cancer and approximately 0.61 million deaths from lung cancer in China. 2 Due to population aging, environmental pollution and smoking, the incidence and mortality of lung cancer are expected to further increase. 3,4Early detection, diagnosis and treatment are key to reducing mortality from lung cancer. 5n recent years, thanks to the wide use of low-dose computer tomography (LDCT) screening, the detection rate of PNs has increased to 51%. 6 The National Lung Cancer Screening Trial (NLST) reported that LDCT screening reduced mortality from lung cancer by approximately 20% in a high-risk population compared with chest radiography. 7However, more than 90% of these nodules are benign, which causes a high number of falsepositive results. 6][8] Therefore, it is very important to find a method to accurately diagnose PNs.
In routine clinical practice, the preoperative diagnosis of PNs mainly depends on clinicians' experience; thus, the diagnosis results are greatly affected by subjective factors.In addition, biopsy is the gold standard for the diagnosis of PNs, but this method is invasive.In response to these problems, many researchers have established mathematical models to diagnose PNs in an objective and noninvasive way.
The Mayo model, 9 Brock model, 10 and Peking University People's Hospital (PKUPH) model 11 are the most commonly used and have been extensively validated.However, these models have some limitations.In the Mayo model, 12% of patients had unclear pathological diagnoses.The Brock model was only recommended for current and former smokers between 50 and 75 years of age without a history of lung cancer.The predictive features of these models were only clinical and imaging features, without laboratory test features.In addition, these models were based on only the logistic regression (LR) algorithm.3][14][15] They are expected to provide novel methods for risk assessment of PNs.Therefore, this study aimed to establish multiclassification ML models for the noninvasive prediction of PNs based on clinical features, imaging features and laboratory test data.We also evaluated the predictive performance of the Mayo model, Brock model and PKUPH model in our patients.We present the following article in accordance with the TRIPOD reporting checklist.

| Patients
This was a multicentre retrospective study.We recruited 914 PN patients from four medical institutions (A, B, C and D) in Chongqing between January 2013 and October 2021.Patients from hospital A (total/male: 632/269, age: 57.73 ± 11.06) were used as the development cohort and were randomly divided into a training set and an internal test set at a ratio of 8:2.Patients from hospitals B (total/ male: 94/53, age: 60.04 ± 11.22), C (total/male: 94/47, age: 59.30 ± 9.86) and D (total/male: 94/61, age: 62.0 ± 11.09) were used as an external validation set.The training set was used for feature selection and model training.The internal test set and external validation set were used to validate the predictive performance of the models.
The inclusion criteria were as follows: (1) the maximal diameter of the nodule was less than 30 mm; (2) the diagnosis of the nodule was pathologically confirmed through operation or biopsy; and (3) the nodule was found to have been radiographically stable for at least 2 years.The exclusion criteria were as follows: (1) prior chemotherapy, radiotherapy or surgical treatment; (2) a history of thoracic cancer or extrathoracic malignant neoplasm within 5 years; (3) nodules with atelectasis or the presence of pleural effusion; and (4) a metastatic tumour.According to pathological diagnosis, all the patients were divided into the benign lesion (BL) group, the precursor lesion (PL) group and the malignant lesion (ML) group.

| Data collection and cleaning
Based on electronic medical records, we collected 79 features relating to clinical features, radiologic features and laboratory test features.In the development cohort, features with a missing rate greater than 20% were removed, and features with a missing rate less than 20% were populated by a mean value or a mode.Finally, eight features were deleted because of missing data, and 71 features were included for analysis.

| Statistical analysis
Data processing involved the following steps.In the first step, univariate analysis was performed to compare the differences in each feature between groups.Continuous variables with a normal distribution are expressed as the means ± standard deviation (SD); otherwise, they are expressed as the medians and interquartile range.Categorical variables were reported as the counts with percentages.According to the distribution of normality and homogeneity of variance, group comparisons of continuous variables were analysed with the analysis of variance (ANOVA) test or Kruskal-Wallis test.Group comparisons of categorical variables were analysed with the chi-square test or Fisher's exact test; p < 0.05 was considered statistically significant.In the second step, the statistically significant features in the univariate analysis were further screened by recursive feature elimination (RFE).In the third step, the selected features were incorporated into the ML models to establish the prediction models, including LR, decision tree (DT), RF and SVM.The accuracy, precision, recall, F1 score, receiver operating characteristic (ROC) curve and area under the curve (AUC) of the models were calculated.Because the predicted types of this study had three categories, and the categories were imbalanced, we compared the predictive performance of the models by weighted average AUC.Finally, the predictive performance of the Mayo model, PKUPH model and Brock model was externally validated in our patients.The descriptions of these three models are shown in Table S1.

| Ethical statement
The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).The study was approved by the Ethics Committee of the Third Affiliated Hospital of Chongqing Medical University (2022-41), and individual consent for this retrospective analysis was waived.

| Predictive feature selection
Predictive feature selection was based on the training set.Table 1 provides the results of the univariate analysis for all 71 features considered as potential predictors in our study.Thirty-one candidate predictive features with p < 0.05 were used as the input data in the REF.Finally, eight features were selected as the predictive features of this study, including age, maximum nodule diameter, nodule type, carcinoembryonic antigen (CEA), cytokeratin 19 fragment (CYFRA21-1), platelet large cell ratio (P-LCR), mean corpuscular haemoglobin concentration (MCHC) and percentage of monocytes (MONO%).The features of patients in the internal test set and external validation are described in Table S3.
To enhance model interpretability, we ranked the importance of features.Figure 2K shows the ranking of feature importance for all features in the RF model.Nodule type and maximal diameter were the top two important features in the model, which made a great contribution to the prediction results.

| Validation of published models
According to the inclusion and exclusion criteria of each model, 302, 337 and 252 cases met the criteria of the Mayo model, PKUPH model and Brock model, respectively.The AUC values of the Mayo model, PKUPH model and Brock model were 0.68 (95% CI: 0.62-0.74),0.64 (95% CI: 0.58-0.70)and 0.57 (95% CI: 0.49-0.65),respectively.Other validation results of these models are shown in Table 4 and Figure 2L.

| DISCUSSION
In this study, we retrospectively collected 914 patients with PNs from four centers and established four multiclassification ML models to predict the probability of malignancy in PNs based on eight features.The stability and repeatability of the models were validated internally and externally.At the same time, we compared the predictive performance of our model with the Mayo model, PKUPH model and Brock model.
Age, maximum nodule diameter, nodule composition, CEA, CYFRA21-1, P-LCR, MCHC and MONO% were selected as the predictive features in this study.[16][17][18][19][20][21][22] The risk of cancer is known to increase with age.In our study, malignant nodules were significantly larger than other nodules, which was consistent with previous studies. 23Ns can be classified into solid, part-solid and pure ground-glass types according to nodule type.Studies have reported that the malignancy rate is significantly higher for part-solid nodules (63%) than for either solid nodules (7%) or pure ground-glass nodules (18%); thus, nodule type may be an effective predictor for PNs. 240][31] Notably, MCHC, p-LCR and MONO% were novel predictors identified in our study, which were rarely reported in other models.This may be because previous research did not take these features into account.Therefore, more studies are needed to confirm whether these features are truly useful in predicting PNs.The Mayo model, 9 Brock model 10 and PKUPH model 11 established by the traditional LR algorithm have been widely recognized and validated.These models achieved a diagnostic accuracy of more than 80%, but the predictive performance for our patients was poor.The AUCs of the Mayo Model and Brock model were only 0.68 and 0.57, respectively, which might be because the two models were established based on the population with a high proportion of benign PNs, while the proportion of malignant PNs in this study was relatively high.In addition, these two models were developed based on the western population and might poorly fit the eastern population.The PKUPH model was developed based on a Chinese population, but its AUC was only 0.64 for our patients.The RF model had the best predictive performance in our study, with weighted average AUC values of 0.81 for the internal test set and 0.71 for the external validation set.RF is an ensemble learning method whose base classifier is DT.It adopts bootstrap resampling, which can effectively avoid the overfitting phenomenon, and its majority voting method can effectively improve the accuracy of classification.Compared with traditional LR algorithms, RF shows better accuracy in dealing with large-scale and high-dimensional data analysis tasks.
Compared with previous studies, this study had the following advantages.(1) The published models were all based on the dichotomous task to distinguish between BLs and MLs.According to the 5th edition of the World health organization (WHO) classification of thoracic tumours, lung tumours were classified as BLs, PLs and MLs, and atypical adenomatous hyperplasia (AAH) and adenocarcinoma in situ (AIS) were no longer classified as malignant tumours. 32Based on this, we established multiclassification prediction models, which might improve management outcomes and realize precision medicine.(2) In addition to the traditional LR model, we established three other ML models and found that the RF model performed best, which might provide a new tool for improving the predictive performance of PNs.(3) In total, 71 candidate variables were analysed in our study, which was far more than that of other studies.This helped to discover more potential risk factors related to the diagnosis of PNs.(4) Our model was developed based on multicentre data and validated internally and externally, indicating that the model had good stability and repeatability.
There were also several limitations in this study.(1) This was a retrospective analysis, and selection bias might have been present.(2) Some of the patients had incomplete data, which may have affected the results of our models.(3) The number of PLs in the external validation set was relatively small (five cases).( 4) The study focused solely on samples that had pathological results.This limited the inclusiveness of the study and its applicability to the wider population.To enhance the effectiveness and generalizability of the model, future research should involve more prospective, large-sample, multicenter and diverse data to fully validate the model's efficacy.(5) We did not create a prediction model using ensemble learning techniques, which could have possibly limited the extent to which we could enhance the performance of the model.Future work will explore the utilization of ensemble models like Bagging and Boosting to potentially attain superior predictive accuracy.

| CONCLUSIONS
In conclusion, we established four multiclassification ML models based on eight predictive features to predict the probability of malignancy of PNs.The RF model showed the best predictive performance, which is expected to replace traditional LR models and provide an noninvasive new tool for the early diagnosis of PNs.

F I G U R E 2
Validation of models (A-H) ROC curve and confusion matrix of the machine learning models for the internal test set.(I, J) ROC curve and confusion matrix of the random forest model for the external validation set.(K) Feature importance derived from random forest model.(L) ROC curve of the Mayo model, PKUPH model and Brock model for pulmonary nodules in our patients.0, 1 and 2 represent malignant lesions, precursor lesions and benign lesions, respectively.PKUPH, Peking University People's Hospital; ROC curve, receiver operating characteristic curve.T A B L E 3 Statistics for machine learning models for the external validation set.
AUTHOR CONTRIBUTIONSConception and design: Q. Liu, X. Lv and Y. Zeng.Administrative support: All authors.Provision of study materials or patients: Q. Liu, X. Lv and D. Zhou.Collection and assembly of data: Q. Liu and X. Lv.Data analysis and interpretation: Q. Liu and X Lv.Manuscript writing: All authors.Final approval of manuscript: All authors.ACKNOWLEDGEMENTSNone.
T A B L E 1 a Calculated using analysis of variance (ANOVA) test.b Calculated using Kruskal-Wallis test.c Calculated using Chi-square test.d Calculated using Fisher's exact test.T A B L E 2 T A B L E 4 Statistics for published models for our patients.: 95% CI, 95% confidence interval; AUC, area under the curve; PKUPH, Peking University People's Hospital. Abbreviations